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Abstract 

Boosting is a popular way to derive power- 
ful learners from simpler hypothesis classes. 
Following previous work (Mason et al., 1999; 
Friedman, 2000) on general boosting frame- 
works, we analyze gradient-based descent al- 
gorithms for boosting with respect to any 
convex objective and introduce a new mea- 
sure of weak learner performance into this 
setting which generalizes existing work. We 
present the weak to strong learning guaran- 
tees for the existing gradient boosting work 
for strongly-smooth, strongly-convex objec- 
tives under this new measure of performance, 
and also demonstrate that this work fails for 
non-smooth objectives. To address this is- 
sue, we present new algorithms which extend 
this boosting approach to arbitrary convex 
loss functions and give corresponding weak to 
strong convergence results. In addition, we 
demonstrate experimental results that sup- 
port our analysis and demonstrate the need 
for the new algorithms we present. 

1. Introduction 

Boosting (Schapire, 2002) is a versatile meta-algorithm 
for combining together multiple simple hypotheses, or 
weak learners, to form a single complex hypothesis 
with superior performance. The power of this meta- 
algorithm lies in its ability to craft hypotheses which 
can achieve arbitrary performance on training data us- 
ing only weak learners that perform marginally better 
than random. This weak to strong learning guarantee 
is a critical feature of boosting. 

To date, much of the work on boosting has focused 
on optimizing the performance of this meta-algorithm 
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with respect to specific loss functions and problem set- 
tings. The AdaBoost algorithm (Freund & Schapire, 
1997) is perhaps the most well known and most suc- 
cessful of these. AdaBoost focuses specifically on the 
task of classification via the minimization of the ex- 
ponential loss by boosting weak binary classifiers to- 
gether, and can be shown to be near optimal in this 
setting. Looking to extend upon the success of Ad- 
aBoost, related algorithms have been developed for 
other domains, such as RankBoost (Freund et al., 
2003) and mutliclass extensions to AdaBoost (Mukher- 
jee & Schapire, 2010). Each of these algorithms pro- 
vides both strong theoretical and experimental re- 
sults for their specific domain, including correspond- 
ing weak to strong learning guarantees, but extending 
boosting to these and other new settings is non-trivial. 

Recent attempts have been successful at generalizing 
the boosting approach to certain broader classes of 
problems, but their focus is also relatively restricted. 
Mukhcrjce and Schapire (2010) present a general the- 
ory of boosting for multiclass classification problems, 
but their analysis is restricted to the multiclass setting. 
Zheng et al. (2007) give a boosting method which 
utilizes the second-order Taylor approximation of the 
objective to optimize smooth, convex losses. Unfortu- 
nately, the corresponding convergence result for their 
algorithm does not exhibit the typical weak to strong 
guarantee seen in boosting analyses and their results 
apply only to weak learners which solve the weighted 
squared regression problem. 

Other previous work on providing general algorithms 
for boosting has shown that an intuitive link between 
algorithms like AdaBoost and gradient descent exists 
(Mason et al., 1999; Friedman, 2000), and that many 
existing boosting algorithms can be reformulated to 
fit within this gradient boosting framework. Under 
this view, boosting algorithms are seen as performing 
a modified gradient descent through the space of all 
hypotheses, where the gradient is calculated and then 
used to find the weak learner which will provide the 
best descent direction. 
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In the case of smooth convex functional^, Mason et al. 
(1999) give a proof of eventual convergence for this pre- 
vious work, but no rates of convergence are given. Ad- 
ditionally, convergence rates of these algorithms have 
been analyzed for the case of smooth convex function- 
al (Ratsch et al., 2002) and for specific potential func- 
tions used in classification (Duffy & Helmbold, 2000) 
under the traditional PAC weak learning setting. 

Our work aims to rigorously define the mathematics 
underlying this connection and show how standard 
boosting notions such as that of weak learner perfor- 
mance can be extended to the general case. Using this 
foundation, we will present weak to strong learning re- 
sults for the existing gradient boosting algorithm (Ma- 
son et al., 1999; Friedman, 2000) for the special case 
of smooth convex objectives under our more general 
setting. 

Furthermore, we will also demonstrate that this ex- 
isting algorithm can fail to converge on non-smooth 
objectives, even in finite dimensions. To rectify this 
issue, we present new algorithms which do have cor- 
responding strong convergence guarantees for all con- 
vex objectives, and demonstrate experimentally that 
these new algorithms often outperform the existing al- 
gorithm in practice. 

Our analysis is modeled after existing work on gradient 
descent algorithms for optimizing over vector spaces. 
For convex problems standard gradient descent algo- 
rithms are known to provide good convergence results 
(Zinkevich, 2003; Boyd & Vandenberghe, 2004; Hazan 
et al., 2006) and are widely applicable. However, as de- 
tailed above, the modified gradient descent procedure 
which corresponds to boosting does not directly follow 
the gradient, instead selecting a descent direction from 
a restricted set of allowable search directions. This 
restricted gradient descent procedure requires new ex- 
tensions to the previous work on gradient descent op- 
timization algorithms. 

A related form of gradient descent with gradient errors 
has previously been studied in the analysis of budgeted 
learning (Sutskever, 2009), and general results related 
to gradient projection errors are given in the litera- 
ture. While these results apply to the boosting setting, 
they lack any kind of weak to strong guarantee. Con- 
versely, we are primarily interested in studying what 
algorithms and assumptions are needed to overcome 
projection error and achieve strong final performance 
even in the face of mediocre weak learner performance. 

The rest of the paper is as follows. We first explicitly 
detail the Hilbert space of functions and various op- 
erations within this Hilbert space. Then, we discuss 



how to quantify the performance of a weak learner in 
terms of this vector space. Following that, we present 
theoretical weak to strong learning guarantees for both 
the existing and our new algorithms. Finally we pro- 
vide experimental results comparing all algorithms dis- 
cussed on a variety of tasks. 

2. L 2 Function Space 

Previous work (Mason et al., 1999; Friedman, 2000) 
has presented the theory underlying function space 
gradient descent in a variety of ways, but never in a 
form which is convenient for convergence analysis. Re- 
cently, Ratliff (2009) proposed the L 2 function space as 
a natural match for this setting. This representation 
as a vector space is particularly convenient as it dove- 
tails nicely with the analysis of gradient descent based 
algorithms. We will present here the Hilbert space of 
functions most relevant to functional gradient boost- 
ing, but the later convergence analysis for restricted 
gradient descent algorithms can be generalized to any 
Hilbert space. 

Given a measurable input set X, an output vector 
space V, and measure [i, the function space L 2 (X, V, fi) 
is the set of all equivalence classes of functions / : X — > 
V such that the Lebesgue integral 

/ wmwUti (i) 

J X 

is finite. We will specifically consider the special case 
where fi is a probability measure P with density func- 
tion p(x), so that (1) is equivalent to Ep[||/(a;)|| 2 ]. 

This Hilbert space has a natural inner product and 
norm: 

(f,9)p= {f{x),g{x)) v p{x)dx 
Jx 

= E P [(f(x),g(x)) v ] 

\\f\\ 2 p=(f,f)p 

= Ep[\\f(x)\\ 2 v }. 

We parameterize these operations by P to denote their 
reliance on the underlying data distribution. In the 
case of the empirical probability distribution P these 
quantities are simply the corresponding empirical ex- 
pected value. For example, the inner product becomes 

1 N 

In order to perform gradient descent over such a space, 
we need to compute the gradient of functionals over 
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said space. We will use the standard definition of a 
subgradient to allow for optimization of non-smooth 
functions. Define VlZ[f] to be a subgradient iff: 

K[f] >n[ 9 ] + (f-g,wn[f}) P 

Here VTZ[f] is a (function space) subgradient of the 
functional 1Z : L 2 (P) — > M at /. Using this definition, 
these subgradients are straightforward to compute for 
a number of functionals. 

For example, for the point-wise loss over a set of train- 
ing examples, 

1 N 

ft-emp[/] = l (f( x n),Vn) 

71=1 

the subgradients in L 2 (X,V,P) are the set: 

V7e emp [/] = {g | g(x n ) G (ViO(/(^),y„)} 

where (Vil)(f(x n ), y n ) is the set of subgradients of 
the pointwise loss I with respect to f(x n ). For differ- 
entiable I, this is just the partial derivative of I with 
respect to input f(x n ). 

Similarly the expected loss, 

K[f]=E P \Ey[l(f(x),y)]}, 
has the following subgradients in L 2 (X, V, P): 
Vn[f]={g\g(x)eE y [(V 1 l)(f(x),y)}}. 

3. Restricted Gradient Descent 

We now outline the gradient-based view of boosting 
(Mason et al., 1999; Friedman, 2000) and how it re- 
lates to gradient descent. In contrast to the standard 
gradient descent algorithm, boosting is equivalent to 
what we will call the restricted gradient descent set- 
ting, where the gradient is not followed directly, but 
is instead replaced by another search direction from 
a set of allowable descent directions. We will refer to 
this set of allowable directions as the restriction set. 

From a practical standpoint, a projection step is neces- 
sary when optimizing over function space because the 
functions representing the gradient directly are com- 
putationally difficult to manipulate and do not gener- 
alize to new inputs well. In terms of the connection 
to boosting, the restriction set corresponds directly to 
the set of hypotheses generated by a weak learner. 

We are primarily interested in two aspects of this re- 
stricted gradient setting: first, appropriate ways to 
find the best allowable direction of descent, and sec- 
ond, a means of quantifying the performance of a re- 
striction set. Conveniently, the function space view of 



Algorithm 1 Naive Gradient Projection Algorithm 
Given: starting point /o, step size schedule {r] t }f =1 

for t = 1, ...,T do 

Compute subgradient V t € VTZ[f]. 
Project Vt onto hypothesis space rl, finding near- 
est direction h* . 

Update /: f t <- j t -i - m%^h* . 
end for 



boosting provides a simple geometric explanation for 
these concerns. 

Given a gradient V and candidate direction h, the clos- 
est point hf along h can be found using vector projec- 
tion: 




Now, given a set of possible descent directions H the 
vector h* which minimizes the resulting projection er- 
ror (2) also maximizes the projected length: 

h = argmax . (d) 

hen \\n\\ 

This is a generalization of the projection operation in 
Mason et al. (1999) to functions other than classifiers. 

For the special case when % is closed under scalar mul- 
tiplication, one can instead find h* by directly mini- 
mizing the distance between V and h* , 

h* = argmin ||V - h\\ 2 (4) 
hen 

thereby reducing the final projected distance found us- 
ing (2). This projection operation is equivalent to the 
one given by Friedman (2000). 

These two projection methods provide relatively sim- 
ple ways to search over any restriction set for the 'best' 
descent direction. The straightforward algorithm (Ma- 
son et al., 1999; Friedman, 2000) for pcforming re- 
stricted gradient descent which uses these projection 
operations is given in Algorithm 1. 

In order to analyze the restricted gradient descent al- 
gorithms, we need a way quantify the relative strength 
of a given restriction set. A guarantee on the perfor- 
mance of each projection step, typically referred to 
in the traditional boosting literature as the edge of a 
given weak learner is crucial to the convergence anal- 
ysis of restricted gradient algorithms. 

For the projection which maximizes the inner product 
as in (3), we can use the generalized geometric notion 
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of angle to bound performance by requiring that 

(V,/i) > cos0||V||||ft|| 

while the equivalent requirement for the norm-based 
projection in (4) is 

||V-/i|| 2 < (1- (cos 6») 2 )||V|| 2 . 

Parameterizing by cos 0, we can now concisely define 
the performance potential of a restricted set of search 
directions, which will prove useful in later analysis. 

Definition 1. A restriction set H has edge 7 if for 
every projected gradient V there exists a vector h G Ti. 
such that either (V,/i) > 7||V||||/i|| or ||V - h\\ 2 < 
(1-7 2 )HV|| 2 . 

This definition of edge is parameterized by 7 G [0, 1], 
with larger values of edge corresponding to lower pro- 
jection error and faster algorithm convergence. 

3.1. Relationship to Previous Boosting Work 

Though these projection operations apply to any L 2 
hypothesis set, they also have convenient interpreta- 
tions when it comes to specific function classes tradi- 
tionally used as weak learners in boosting. 

For a classification-based weak learner with outputs in 
{— 1, +1} and an optimization over single output func- 
tions / : X M, projecting as in (3) is equivalent to 
solving the weighted classification problem over exam- 
ples {x„,sgn(V(x„))}^ =1 and weights w n = |V(x„)|. 

The projection via norm minimization in (4) is equiv- 
alent to solving the regression problem 

1 N 

h* = argmin — V ||V(x„) - f{x n )\\ 2 
h ^ N £=i 

using the gradient outputs as regression targets. 

Similarly, our notion of weak learner performance in 
Definition 1 can be related to previous work. Like our 
measure of edge which quantifies performance over the 
trivial hypothesis h(x) — 0, Vx, previous work has used 
similar quantities which capture the advantage over 
baseline hypotheses. 

For weak learners which are binary classifiers, as is the 
case in AdaBoost (Freund & Schapire, 1997), there is 
an equivalent notion of edge which refers to the im- 
provement in performance over predicting randomly. 
We can show that Definition 1 is an equivalent mea- 
sure: 

Theorem 1. For a weak classifier space % with out- 
puts in {— 1, +1}, the following statements are equiva- 
lent: (1) H has edge 7 for some 7 > 0, and (2) for any 



non-negative weights w n over training data x n , there 
is a classifier h G % which achieves an error of at most 

(5 ~ I) En W » f 0r SOme <5 > 0. 

A similar result can be shown for more recent work on 
multiclass weak learners (Mukherjee & Schapire, 20f 0) 
when optimizing over functions with multiple outputs 
/ : X -> R k : 

Theorem 2. For a weak multiclass classifier space % 
with outputs in {1, ...,K}, let the modified hypothe- 
sis space H! contain a hypothesis h' : X — >■ M. K for 
each h G % such that h'(x)k — 1 if h(x) = k and 
h'(x) = — K 1 _ 1 otherwise. Then, the following state- 
ments are equivalent: (1) W has edge 7 for some 
7 > 0, and (2) % satisfies the performance over base- 
line requirements detailed in Theorem 1 of (Mukherjee 
& Schapire, 2010). 

Proofs and more details on these equivalences can be 
found in Appendix A. 

4. Convergence Analysis 

We now focus on analyzing the behavior of variants of 
the basic restricted gradient descent algorithm shown 
in Algorithm 1 on problems of the form: 

min TZ emp [f], 

where allowable descent directions are taken from some 
restriction set T~L C T . 

In line with previous boosting work, we will specifically 
consider cases where the edge requirement in Defini- 
tion 1 is met for some 7, and seek convergence results 
where the empirical objective 7?. emp [ft] approaches the 
optimal training performance minygjr lZ emp [f]. This 
work does not attempt to analyze the convergence of 
the true risk, 1Z[f\. 

While we consider L 2 function space specifically, the 
convergence analysis presented can be extended to op- 
timization over any Hilbert space using restricted gra- 
dient descent. 

4.1. Smooth Convex Optimization 

An earlier result showing 0((1 — -^) T ) convergence of 
the objective to optimality for smooth functionals is 
given by Ratsch, et al. (Ratsch et al., 2002) using 
results from the optimization literature on coordinate 
descent. Alternatively, this gives a 0(log(^)) result 
for the number of iterations required to achieve error 
e. Similar to our result, this work relies on the smooth- 
ness of the objective as well as the weak learner per- 
formance, but uses the more restrictive notion of edge 
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from previous boosting literature specifically tailored 
to PAC weak learners (classifiers). This previous re- 
sult also has an additional dependence on the number 
of weak learners and number of training examples. 

We will now give a generalization of the result in 
(Ratsch ct al., 2002) which uses our more general defi- 
nition of weak learner edge. The convergence analysis 
of Algorithm 1 relies on two critical properties of the 
objective functional 1Z. 

A functional 1Z is X-strongly convex if V/, /' G T: 

n[f] >K[f] + (VK[f], /'-/>+ ^||/' -/|| 2 
for some A > 0, and A-strongly smooth if 

n[f] < n\f] + (vn[f},f - /) + |||/' - /|| 2 

for some A > 0. Using these two properties, we can 
now derive a convergence result for unconstrained op- 
timization over smooth functions. 

Theorem 3 (Generalization of Theorem 4 in (Ratsch 
et al., 2002)). Let 7Z emp be a X-strongly convex and A- 
strongly smooth functional over L 2 (X,P) space. Let 
Ti C L 2 be a restriction set with edge 7. Let f* = 
a,Tgminj e jrTZ emp [f]. Given a starting point fo and 
step size r\ t = -r, after T iterations of Algorithm 1 
we have: 

T^emp[fT]—T^emp[f*] < (l - ^T - ) T (K emp [/o] — TZemp [/*] ) ■ 

The result above holds for the fixed step size j as well 
as for step sizes found using a line search along the de- 
scent direction. The analysis uses the strong smooth- 
ness requirement to obtain a quadratic upper bound 
on the function and then makes guaranteed progress 
by selecting the step size which minimizes this bound, 
with larger gains made for larger values of 7. A com- 
plete proof is provided in Appendix B. 

Theorem 3 gives, for strongly smooth objective func- 
tionals, a convergence rate of 0((1 — - L ^) T )- This is 
very similar to the 0((1 — 47 2 )¥ ) convergence of Ad- 
aBoost (Freund & Schapire, 1997), with both requiring 
0(log(i)) iterations to get performance within e of op- 
timal. While the AdaBoost result generally provides 
tighter bounds, this relatively naive method of gradi- 
ent projection is able to obtain reasonably competitive 
convergence results while being applicable to a much 
wider range of problems. This is expected, as the pro- 
posed method derives no benefit from loss-specific op- 
timizations and can use a much broader class of weak 
learners. This comparison is a common scenario within 



optimization: while highly specialized algorithms can 
often perform better on specific problems, general so- 
lutions often obtain equally impressive results, albeit 
less efficiently, while requiring much less effort to im- 
plement. 

Unfortunately, the naive approach to restricted gra- 
dient descent breaks down quickly in more general 
cases such as non-smooth objectives. Consider the 
following example objective over two points xi,X2'. 
Tl[f] = 2|/(ati)| + |/(i2) |- Now consider the hypoth- 
esis set h € T-L such that either h(xi) € { — 1, +1} and 
h(x 2 ) = or h{ Xl ) = and h(x 2 ) € {-1,+1}. The 
algorithm will always select h* such that h*(x 2 ) = 
when projecting gradients from the example objective, 
giving a final function with perfect performance on 
x\ and arbitrarily poor unchanged performance on x 2 . 
Even if the loss on training point x 2 is substantial, the 
naive algorithm will not correct it. 

An algorithm which only ever attempts to project sub- 
gradients of 1Z, such as Algorithm 1, will not be able to 
obtain strong performance results for cases like these. 
The algorithms in the next section overcome this ob- 
stacle by projecting modified versions of the subgradi- 
ents of the objective at each iteration. 

4.2. General Convex Optimization 

For the convergence analysis of general convex func- 
tions we now switch to analyzing the average optimal- 
ity gap: 

if>[/ t ]-ft[/*]], 
t=i 

where /* = argmin Y^t=i ^[/] i s the fixed hypothesis 
which minimizes loss. 

By showing that the average optimality gap ap- 
proaches as T grows large, for decreasing step sizes, 
it can be shown that the optimality gap TZ[ft] — 7Z[f*] 
also approaches 0. 

This analysis is similar to the standard no-regret online 
learning approach, but we restrict our analysis to the 
case when TZt = 1Z. This is because the true online 
setting typically involves receiving a new dataset at 
every time t, and hence a different data distribution P t , 
effectively changing the underlying L 2 function space 
at every time step, making comparison of quantities 
at different time steps difficult in the analysis. The 
convergence analysis for the online case is beyond the 
scope of this paper and is not presented here. 

The convergence results to follow are similar to previ- 
ous convergence results for the standard gradient de- 
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Algorithm 2 Repeated Gradient Projection Algo- 
rithm 

Given: starting point / , step size schedule {n t }J =1 

for t = 1, ...,T do 

Compute subgradient V t G V1Z[f]. 
Let V = V t , ft* = 0. 
for fc = 1, . . . , i do 

Project V onto hypothesis space H, finding 

nearest direction ft*.. 

h* ^ ft* + 

V <- V - h%. 
end for 

Update /: f t <- / f _ x - 774/1*. 
end for 



scent setting (Zinkevich, 2003; Hazan ct al., 2006), but 
with a number of additional error terms due to the gra- 
dient projection step. Sutskever (2009) has previously 
studied the convergence of gradient descent with gra- 
dient projection errors using an algorithm similar to 
Algorithm 1, but the analysis does not focus on the 
weak to strong learning guarantee we seek. In order 
to obtain this guarantee we now present two new al- 
gorithms. 

Our first general convex solution, shown in Algorithm 
2, overcomes this issue by using a meta-boosting strat- 
egy. At each iteration t instead of projecting the 
gradient V* onto a single hypothesis ft*, we use the 
naive algorithm to construct ft* out of a small num- 
ber of restricted steps, optimizing over the distance 
|| Vt — ft* || 2 . By increasing the number of weak learn- 
ers trained at each iteration over time, we effectively 
decrease the gradient projection error at each itera- 
tion. As the average projection error approaches 0, the 
performance of the combined hypothesis approaches 
optimal. We now give convergence results for this al- 
gorithm for both strongly convex and convex function- 
als. 

Theorem 4. Let lZ e mp be a X-strongly convex func- 
tional over J 7 . Let % C T be a restriction set with edge 
7. Let \\VK[f]\\p < G. Let /* = arg min /e ^ emp [/] . 
Given a starting point f and step size Vt = ji, after 
T iterations of Algorithm 2 we have: 

^Y.^ P [ft]-Tl emp [r]\ < g(l + lnT+i^). 

t=i ' 

The proof (Appendix C) relies on the fact that as the 
number of iterations increases, our gradient projection 
error approaches at the rate given in Theorem 3, 



Algorithm 3 Residual Gradient Projection Algo- 
rithm 

Given: starting point / , step size schedule {r] t }J =1 

Let A = 0. 

for t = 1, ...,T do 

Compute subgradient V t £ V7£[/]. A <- A + V f . 

Project A onto hypothesis space %, finding near- 
est direction ft*. 

Update /: f t <- f t _ x - Vtjj^h*. 

Update residual: A A - T^rjp ft* 
end for 



causing the behavior of Algorithm 2 to approach the 
standard gradient descent algorithm. The additional 
error term in the result is a bound on the geometric 
series describing the errors introduced at each time 
step. 

Theorem 5. Let 7Z e mp be a convex functional over 
J- . Let % C T be a restriction set with edge 7. Let 
\\VK[f]\\p < G and ll/Hp < F for all f e T. Let 
f* = argminj e jr7?. em p[/]. Given a starting point fa 
and step size rj t — ^, after T iterations of Algorithm 
2 we have: 

Again, the result is similar to the standard gradient 
descent result, with an added error term dependent 
on the edge 7. 

An alternative version of the repeated projection al- 
gorithm allows for a variable number of weak learners 
to be trained at each iteration. An accuracy thresh- 
old for each gradient projection can be derived given 
a desired accuracy for the final hypothesis, and this 
threshold can be used to train weak learners at each 
iteration until the desired accuracy is reached. 

Algorithm 3 gives a second method for optimizing over 
convex objectives. Like the previous approach, the 
projection error at each time step is used again in pro- 
jection, but a new step is not taken immediately to 
decrease the projection error. Instead, this approach 
keeps track of the residual error left over after projec- 
tion and includes this error in the next projection step. 
This forces the projection steps to eventually account 
for past errors, preventing the possibility of systematic 
error being adversarially introduced through the weak 
learner set. 

As with Algorithm 2, we can derive similar conver- 
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gence results for strongly-convex and general convex 
functionals for this new residual-based algorithm. 



Theorem 6. Let lZ e 



be a X- strongly convex func- 



tional over J- . Let T-L C T be a restriction set with edge 
7. Let \\VK[f]\\p < G. Let f* = arg min /e ^ emp [/] . 
Let c — \. Given a starting point fo and step size 



h, after T iterations of Algorithm 3 we have: 



\ jyi\h\ - n emp [f*]] < + InT + |). 

t=l 



Theorem 7. Let 7Z e mp be a convex functional over 
T . Let % C T be a restriction set with edge 7. Let 
l|Vft[/]||p < G and\\f\\p < F for all f e T . Let f* = 
argminj 6 ^r7?.emp[/]- Let c = \. Given a starting 
point /o and step size r\ t — -^j, after T iterations of 
Algorithm 3 we have: 

T 



^Y,[n emv {ft]-n emp [r]] < 



F 2 



c 2 G 2 c 2 G 2 



2VT 



T 



2Ti 



Again, the results are similar bounds to those from the 
non-restricted case. Like the previous proof, the extra 
terms in the bound come from the penalty paid in pro- 
jection errors at each time step, but here the residual 
serves as a mechanism for pushing the error back to 
later projections. The analysis relies on a bound on 
the norm of the residual A, derived by observing that 
it is increased by at most the norm of the gradient 
and then multiplicatively decreased in projection due 
to the edge requirement. This bound on the size of 
the residual presents itself in the c term present in the 
bound. Complete proofs are presented in Appendix C. 

In terms of efficiency, these two algorithms are simi- 
larly matched. For the strongly convex case, the re- 
peated projection algorithm uses 0(T 2 ) weak learners 
to obtain an average regret 0(^- + ^hf), while the 
residual algorithm uses 0(T) weak learners and has 
average regret 0(]^). The major difference lies in fre- 
quency of the gradient evaluation, where the repeated 
projection algorithm evaluates the gradient much less 
often than the than the residual algorithm. 



5. Experimental Results 

We present preliminary experimental results for these 
new algorithms on three tasks: an imitation learning 
problem, a ranking problem and a set of sample clas- 
sification tasks. 

The first experimental setup is an optimization prob- 
lem which results from the Maximum Margin Planning 
(Ratliff ct al., 2009) approach to imitation learning. 
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Figure 1 . Test set loss vs number of weak learners used for 
a maximum margin structured imitation learning problem 
for all three restricted gradient algorithms. 



In this setting, a demonstrated policy is provided as 
example behavior and the goal is to learn a cost func- 
tion over features of the environment which produce 
policies with similar behavior. This is done by opti- 
mizing over a convex, non-smooth loss function which 
minimizes the difference in costs between the current 
and demonstrated behavior. Previous attempts in the 
literature have been made to adapt boosting to this 
setting (Ratliff et al., 2009; Bradley, 2009), similar to 
the naive algorithm presented here, but no convergence 
results for this settings are known. 

Figure 1 shows the results of running all three of the al- 
gorithms presented here on a sample planning dataset 
from this domain. The weak learners used were neural 
networks with 5 hidden units each. 

The second experimental setting is a ranking task from 
the Microsoft Learning to Rank Datasets, specifically 
MSLR- WEB 1 OK (ms:, 2010), using the ranking ver- 
sion of the hinge loss and decision stumps as weak 
learners. Figure 2 shows the test set disagreement 
(the percentage of violated ranking constraints) plot- 
ted against the number of weak learners. 

As a final test, we ran our boosting algorithms on sev- 
eral multiclass classification tasks from the UCI Ma- 
chine Learning Repository (Frank & Asuncion, 2010), 
using the 'connect4', 'letter', 'pendigits' and 'satimage' 
datasets. All experiments used the multiclass exten- 
sion to the hinge loss (Crammer & Singer, 2002), along 
with multiclass decision stumps for the weak learners. 

Of particular interest are the experiments where the 
naive approach to restricted gradient descent clearly 
fails to converge ('connects and 'letter'). In line 
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Figure 3. Performance on multiclass classification experiments over the UCI 'connect4', 'letter', 'pendigits' and 'satimage' 
datasets. The algorithms shown are the naive projection (black dashed line), repeated projection steps (red solid line), 
and the residual projection algorithm (blue long dashed line). 
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Figure 2. Test set disagreement (fraction of violated con- 
straints) vs number of weak learners used for the MSLR- 
WEB10K ranking dataset for all three restricted gradient 
algorithms. 



with the presented convergence results, both non- 
smooth algorithms approach optimal training perfor- 
mance at relatively similar rates, while the naive ap- 
proach cannot overcome the particular conditions of 
these datasets and fails to achieve strong performance. 
In these cases, the naive approach repeatedly cycles 
through the same weak learners, impeding further op- 
timization progress. 
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A. Equivalence of boosting requirements 

First, we demonstrate that our requirement is equivalent to the AdaBoost style weak learning requirement on 
weak classifiers. 

Theorem 1. For a weak classifier space % with outputs in { — 1,+1}, the following statements are equivalent: 
(1) W has edge 7 for some 7 > 0, and (2) for any non-negative weights w n over training data x n , there is a 
classifier h € H which achieves an error of at most (5 — f ) ^2 n % for some S > 0. 

Proof. To relate the weighted classification setting and our inner product formulation, let weights w n = \\/{x n )\ 
and labels y n = sgn(V(x n )). We examine classifiers h with outputs in {— 

Consider the AdaBoost weak learner requirement re-written as a sum over the correct examples: 

Y Wn - 4 + ^Y Wn - 

n,h(x rl )—y n Tl 

Breaking the sum over weights into the sum of correct and incorrect weights: 

\{ Y Wn ~ Y W n )>-J2 W n- 

n,h(x n )—y n n.h(x n )^y n Tt 

The left hand side of this inequality is just N times the inner product (V,/i), and the right hand side can be 
re-written as the 1-norm of the weight vector w, giving: 

N(V,h) > (SIMIj 
>S\\w\\ 2 



Finally, using \\h\\ — 1 and ||V 



2 _ 1 



(V,A>>-^||V||W 



showing that the AdaBoost requirement implies our requirement for edge 7 > ^= > 0. 
We can show the converse by starting with our weak learner requirement and expanding: 

(V,fc> >7||V|||W| 
Jf( Y Wn ~ Y 0>7lM| 

71, h(x n )—y n n,h(x n )^y n 

Then, because ||V|| 2 = jf\\w\\l and \\w\\ 2 > -^IMIi w e get: 



Y Wn ~ Y Wn - t^imIi 

> iY Wn 



N 

n,h(x n )—y n n J h(x Tl )^y n 



Y Wn - (\ + 1^Y Wn > 

n,h(x ri )—y n n 

giving the final AdaBoost edge requirement. □ 
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In the first part of this proof, the scaling of ^= shows that our implied edge weakens as the number of data 
points increases in relation to the AdaBoost style edge requirement, an unfortunate but necessary feature. 
This weakening is necessary because our notion of strong learning is much more general than other boosting 
frameworks. In those settings, strong learning only guarantees that any dataset can be classified with training 
error, while our strong learning guarantee gives optimal performance on any convex loss function. 

Theorem 2. For a weak multiclass classifier space 7i with outputs in {1, . . . , K}, let the modified hypothesis 
space 7i! contain a hypothesis hi : X — > R K for each h £ 7i such that h'(x)k = 1 if h(x) = k and h'(x) = — K 1 _ 1 
otherwise. Then, the following statements are equivalent: (1)71' has edge 7 for some 7 > 0, and (2) 7i satisfies 
the performance over baseline requirements detailed in Theorem 1 of (Mukherjee & Schapire, 2010). 



Proof. In this section we consider the multiclass extension of the previous setting. Instead of a weight vector we 
now have a matrix of weights w where w n k is the weight or reward for classifying example x n as class k. We can 
simply let weights w n k = V(x n fe) and use the same weak learning approach as in (Mukherjee & Schapire, 2010). 
Given classifiers h(x) which output a label in {1, . . . ,K}, we convert to an appropriate weak learner for our 
setting by building a function h'(x) which outputs a vector y £ 1Z K such that yk = 1 if h(x) — k and yk — — K _ x 
otherwise. 

The equivalent AdaBoost style requirement uses costs c n k = —w n k and minimizes instead of maximizing, but 
here we state the weight or reward version of the requirement. More details on this setting can be found in 
(Mukherjee & Schapire, 2010). We also make the additional assumption that Yl^ w nk = 0,Vn without loss of 
generality. This assumption is fine as we can take a given weight matrix w and modify each row so it has mean, 
and still have a valid classification matrix as per (Mukherjee & Schapire, 2010). Furthermore, this modification 
does not affect the edge over random performance of a multiclass classifier under their framework. 

Again consider the multiclass AdaBoost weak learner requirement rc-written as a sum of the weights over the 
predicted class for each example: 



w nh( Xn) > - ~) £ + K 



n,k n 



we can then convert the sum over correct labels to the max-norm on weights and multiply through by : 

1 S 

n n,k n,k n 



K 



K — 



by the fact that the correct label y n = argmax fe w n k- 

The left hand side of this inequality is just the function space inner product: 

N<y,h') > ^t^EIKIIoo-^E^)- 

n 71, k 



Using the fact that ^2 k w n k = along with ||V|| < ^= J2 n ll^nlb an< ^ \W\\ = y T^i we can now bound the 
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right hand side: 



> 



K 



K 



t <*£ik,i 



> ^-<^||V|| 



N 

For K > 2 wc get 7 > showing that the existence of the AdaBoost style edge implies the existence of ours. 
Again, while the requirements arc equivalent for some fixed dataset, we see a weaking of the implication as the 
dataset grows large, an unfortunate consequence of our broader strong learning goals. 

Now to show the other direction, start with the inner product formulation: 

(V,/i') > <5||V||||/i'|| 

^(E^(xn)-^y £ ^nk)>S\M\\\h'\\ 



n n,k 



Using \\h'\\ = \f^i and ||V|| > ^ E„ IWI2 we can show: 



K 



-j E w «m^) - £ w « fe ^ 5 £ iKH 2 y ^tt- 

n n.fc n 

Rearranging we get: 



.A- ' 



n n.fc ' n 



E > ^E^ + v y~j s( ^2 H w «ii2 - ^ E iww 

n n.k n n 

Next, bound the 2- norms using ||w n || 2 > II w n Hi arL d || ifn H2 — ll w «llco anc ^ then rewrite as sums of corre- 
sponding weights to show the multiclass AdaBoost requirement holds: 



E w «»(^) ^ (^- ^=^ )E w "fc + v^i^ l|w " 



1 (5 

E w «^) - ~ ^)E u; «' £ + 5 E w " 



n,k 



□ 
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B. Smooth Convergence Results 

For the proofs in this section, all norms and inner products are assumed to be with respect to the empirical 
distribution P. 

Theorem 3. Let lZ em p be a X-strongly convex and K-strongly smooth functional over L 2 (X,P) space. Let 
% C L 2 be a restriction set with edge 7. Let f* = &rgia\\ij £:F TZ ern p[f]. Given a starting point fo and step size 
T]t = t, after T iterations of Algorithm 1 we have: 



7 2 A T 

T^-emplfr} — TZ-emplf*] < (1 ) (^-ernpl/o] — ^emp[/*]) 



Proof. Starting with the definition of strong smoothness, and examining the objective value at time t + 1 we 
have: 

K[ft+i] < n[f t ] + {VR[f t ],ft+i - ft) + - M\ 2 



mi £ 1 (VR[/iU t ) , , 

Then, using f t+1 = ^ ^Qi h t we get: 



1 (VK[f t ],h t ) 



2 



nit+x] <K[f t \- OA ,„ n2 



2A II/, 



Subtracting the optimal value from both sides and applying the edge requirement we get: 

nft+i] -nn < nh\ - nn ^nm\ 2 

From the definition of strong convexity we know || V7Z[ft] || 2 > 2\(lZ[ft] —lZ[f*]) where /* is the minimum point. 
Rearranging we can conclude that: 

nif t+1 ] -nn < (K[f t ] - nrmi - ^) 

Recursively applying the above bound starting at t = gives the final bound on 1Z [fx] — Tl [fo] ■ □ 
C. General Convergence Results 

For the proofs in this section, all norms and inner products are assumed to be with respect to the empirical 
distribution P. 

Theorem 4. Let lZ e mp be a X-strongly convex functional over J 7 . Let % a J- be a restriction set with edge 7. 
Let ||V7?.[/]||p < G. Let f* — axgrniiij e:F lZ em p[f]. Given a starting point f and step size r/ t = j 2 ^, after T 
iterations of Algorithm 2 we have: 



2 1 „,2 



1-7* 



f — 1 ' 



Proof. First, we start by bounding the potential \\ft~ f*\\ 2 , similar to the potential function arguments in 
(Zinkevich, 2003; Hazan et al., 2006), but with a different descent step: 

||/ t+ i-/l 2 <ll/*-^ t )-/l 2 

= \\f t - r 11 2 + % 2 ini 2 - - r,ht - v t ) - 2 Vt {ft - r, v t ) 
(r - h, v t ) < ^n/t+i - rf - ~u/t - r 11 2 - ~ini 2 - (r - /*, w - v t ) 
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Using the definition of strong convexity and summing: 

T T T 

$>[/*] > +E</*- h, v t ) +E9ii/*-/*n 



T A, 



2 1 

t=i t=i t=i t=i 

T , T-l 



> e^[/*] - ^iia - rf + e - r n 2 (- - — + a)- 
Ef iim'-E^*--^-^) 

t=l t=l 



Setting r? t = 4 and use bound \\h t \\ < 2||V t || < 2G 



Ew] > E^i/t] - ^ E ^ - 1 E(ii/« - n\ 2 E (/* - - v *» 
t=i *=i {=1 {=1 t=i 



2 



T AC 2 1 T 

> E^t/*] ~ -ir (! + lnT ) - a E I'' 1 * - v < 
t=i ' t=i 



Using the result from 3 we can bound the error at each step t: 



, _ G 2 -' 



E^[/i - E 7 ^] - -^r-( 1+lnT ) - -r E(! - 

{=1 t=l " t=l 



A A 7 2 

giving the final bound. □ 



Theorem 5. Let lZ emp be a convex functional over J- . Let % <Z J- be a restriction set with edge 7. Let 
||V7?.[/]||p < G and \\f\\p < F for all / £ J. Let f* = argmku e y72. emp [/]. Given a starting point fo and step 
size rj t — after T iterations of Algorithm 2 we have: 

F 2 G 2 ,^l-7 2 



\ E^- P [/t] - n emp [f*}} <-f- + ^=+2FG 



Proof. Like the last proof, we start with the altered potential and sum over the definition of convexity: 

T T T— 1 

E^t/i >!>[/*] - 1 n /i - ^n 2 + E Jn/'+i - /i a (- — )- 

T T 

Ef ini'-E^*--^-^) 
t=i t=i 
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Setting r) t = ^ and using bound \\h t \\ < ||V t || < G and the result from 3 we can bound the error at each step t: 

f>[/*] > f>[jy - -Jhi/t - r ii 2 - %- E - E </* - ^ - v *> 
t=i {=1 /t t=i vr t=i 

t=l t=l 

> E nh\ ^f- g 2 Vt 2fg 1 -^- 

t=l ' ^ 

giving the final bound. □ 

Theorem 6. Let lZ em p be a X-strongly convex functional over J- . Let % C J- be a restriction set with edge 7. 
Let ||V7?.[/]||p < G. Let f* — argmin f e jrlZ ernp [f]. Let c = \. Given a starting point f and step size rj t = 
after T iterations of Algorithm 3 we have: 

i Y}n[f t ] - n emp [r]\ < + inT + 1). 
t=i 

Proof. Like the proof of Theorem 4, we again use a potential function and sum over the definition of convexity: 

T T T—l 

$>[/i >2>[/ t ] - i-iiA - r 11 2 + E |n/w - n 2 £ ^- + A )- 
*=i t=i 771 t=i z 7?t+i 

T T T-l 

E f imi 2 - E </* - /*. ^ - ( A * + v *)> - E </* - A *+i) 

t=l f=l t=0 

>EK[/.]-^i/,-/Y + | 1 i|i/. + ,-rii^-^ + A)- 

T T T-l T-l 

E f ini 2 - E </* - /*. ^ - ( A * + v *)) - E (f - /*> A *+o - E A *+i) 

t=l t=l i=0 t=0 

where /it is the augmented step taken in Algorithm 3. 

Setting rjt = h and use bound \\h t \\ < ||Vt|| < G, along with A t+ i = (A t + V t ) — h t : 



it 

X>[/i > j>[/ t ] Efii^n 2 - «/* - /t+i, a 4+1 ) - ^nr - /t+iIi 2 ) - E A *+i) 



t=i *=i *=i 



We can bound the norm of A 4 by considering that (a) it start at and (b) at each time step it increases by 
most Vt and is multiplied by 1 — j 2 . This implies that \\AA\ < cG where c = ,— — . < \. 

l-^/l-7 2 7 

From here we can get a final bound: 

t=i t=i 



at 



□ 



Theorem 7. Lei lZ e mp 6e a convex functional over J 7 . Let % C T be a restriction set with edge 7. Let 
||V7£[/]||p < G and \\f\\p < F for all f E T . Let f* = aigmmj: € jrTZ ern p[f]. Let c = Given a starting point 
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/o and step size r)t = after T iterations of Algorithm 3 we have: 

Iv^r rn_ r/*n J— f!^! C ' G2 



2T§ ' 



Proof. Similar to the last few proofs, we get a result similar to the standard gradient version, with the error 
term from the last proof: 



T— 1 

1 11 



i>LT] ^E^l ~ ;r IIA ~ ^H 2 + E lllA+i - /T(_ , 
t =i t=i m t=i z 774 %+1 



E f ini 2 - «r - h+u A w > - ^nr - /T+iii 2 ) - £ A '+i 
*=i t=i 



Using the bound on ||A t || < c from above and setting r\ t 



2VT 

L— 1. t ± 

giving the final bound. □ 



